N-gram: a language independent approach to IR and NLP

نویسندگان

  • P Majumder
  • M Mitra
  • B. B. Chaudhuri
چکیده

With the increasingly widespread use of computers & the Internet in India, large amounts of information in Indian languages are becoming available on the web. Automatic information processing and retrieval is therefore becoming an urgent need in the Indian context. Moreover, since India is a multilingual country, any effective approach to IR in the Indian context needs to be capable of handling a multilingual collection of documents. In this paper, we discuss the N-gram approach to developing some basic tools in the area of IR and NLP. This approach is statistical and language independent in nature, and therefore eminently suited to the multilingual Indian context. We first present a brief survey of some language-processing applications in which N-grams have been successfully used. We also present the results of some preliminary experiments on using N-grams for identifying the language of an Indian language document, based on a method proposed by Cavnar et al [1].

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application of Variable Length N-gram Vectors to Monolingual and Bilingual Information Retrieval

Our group in the Department of Informatics at the University of Oviedo has participated, for the first time, in two tasks at CLEF: monolingual (Russian) and bilingual (Spanish-to-English) information retrieval. Our main goal was to test the application to IR of a modified version of the n-gram vector space model (codenamed blindLight). This new approach has been successfully applied to other NL...

متن کامل

Indexing and searching strategies for the Russian language

This paper describes and evaluates various stemming and indexing strategies for the Russian language. We design and evaluate two stemming approaches, a light and a more aggressive one, and compare these stemmers to the Snowball stemmer, to no stemming, and also to a language-independent approach (n-gram). To evaluate the suggested stemming strategies we apply various probabilistic information r...

متن کامل

UniNE at Domain-Specific IR - CLEF 2008: Scientific Data Retrieval: Various Query Expansion Approaches

Our first objective in participating in this domain-specific evaluation campaign is to propose and evaluate various indexing and search strategies for the German, English and Russian languages, in an effort to obtain better retrieval effectiveness than that of the language-independent approach (n-gram). To do so we evaluate the GIRT-4 test-collection using the Okapi, various IR models derived f...

متن کامل

Generating a Distilled N-Gram Set - Effective Lexical Multiword Building in the SPECIALIST Lexicon

Multiwords are vital to better Natural Language Processing (NLP) systems for more effective and efficient parsers, refining information retrieval searches, enhancing precision and recall in Medical Language Processing (MLP) applications, etc. The Lexical Systems Group has enhanced the coverage of multiwords in the Lexicon to provide a more comprehensive resource for such applications. This pape...

متن کامل

A New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model

Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006